Grades 10–11 | Week 1: How LLMs Work & Why They Fail 1 / 22

How LLMs Work & Why They Fail

Week 1 of 3 — AI Course • Grades 10–11

50 Minutes | Theory + Concepts + Discussion

Week 1: Theory ← You are here
Week 2: Hands-On Activities
Week 3: Exam

Today's Lesson

1

What is an LLM? — The big picture (8 min)

2

Tokens & Training — How LLMs actually work (12 min)

3

Context Windows — AI's short-term memory (8 min)

4

Why LLMs Fail — Hallucinations & other limits (15 min)

5

Discussion & Exit Ticket (7 min)

Before We Start...

Opening Question

Who has used ChatGPT, Claude, or Gemini? What did you use it for?

And — has anyone ever seen it give a wrong or weird answer?

By the end of today you'll understand exactly why those wrong answers happen — and be able to explain it to someone else.

Large Language Model

An AI trained on massive amounts of text to predict what word (or token) comes next

💡 Think: autocomplete on your phone — but trained on billions of pages of text and vastly more sophisticated

ChatGPT

by OpenAI

Claude

by Anthropic

Gemini

by Google

Llama

by Meta (open source)

All of these work on the same core principle.

Tokens: The Building Blocks

LLMs don't read words like we do.
They read tokens.

1 token ≈ ¾ of a word, or about 4 characters

Tokens in Practice

"Hello world"

= 2 tokens

"Unbelievable"

= 3 tokens
un · believ · able

"I love pizza"

= 4 tokens

"Artificial intelligence"

≈ 5–6 tokens

Quick Activity — Pair up

Guess the token count: "The quick brown fox jumps over the lazy dog"

Answer: ~10 tokens. Why does this matter? Every model has a token limit.

Why Tokens Matter

Think About It

If a model costs $0.01 per 1,000 tokens, and your app sends 500 messages a day averaging 200 tokens each — how much does that cost per month?

Answer: 500 × 200 = 100,000 tokens/day × 30 = 3M tokens = $30/month. Real product design involves this kind of thinking.

How LLMs Are Trained

1. Collect Data
Books, web, code,
Wikipedia, Reddit…
2. Learn Patterns
Predict next token,
billions of times
3. Fine-Tune
Specialise for
a specific task

Class Question

"If we train an LLM on internet data — what problems could that cause?"

Expected: bias, misinformation, offensive content, outdated information, underrepresentation of some languages

What "Learning Patterns" Actually Means

"The cat sat on the ___"

The model guesses: mat, floor, chair, table

It gets feedback on which predictions are statistically likely.

It repeats this billions of times across billions of sentences.

⚠️ Key Point

The model is learning statistical patterns, not facts about the world. This distinction explains almost every limitation we'll cover today.

Fine-Tuning

After basic training, companies specialise the model for specific tasks or behaviours:

ChatGPT / Claude

Fine-tuned to be helpful, harmless, and honest in conversation

GitHub Copilot

Fine-tuned specifically for writing and explaining code

Medical LLMs

Fine-tuned on clinical notes, research papers, patient data

Your project

You'd fine-tune (or prompt-engineer) for your specific use case

Context Windows

🧠 Imagine a friend who can only remember the last 10 sentences you said — everything before that is gone.

Context Window — In Practice

Example: You have a 10,000-word tutoring session with a model that has an 8,000-word limit.

→ The model silently forgets the first 2,000 words.

→ If you refer back to something from the start, the model has no idea what you mean.

Design Challenge — 3 minutes

You're building an AI study buddy. A student uses it for 2 hours. How do you stop the model losing important context?

Ideas to draw out: summarise key facts at intervals, save student name/topic/goals separately, prompt the model to recap periodically

Hallucinations

When AI confidently states false information

⚠️ Most Important Point Today

The AI is not lying. It has no concept of truth. It is always doing the same thing — predicting the most plausible next token. Sometimes that token is wrong.

Real Example 1

Lawyers submitted a brief with completely made-up court cases that ChatGPT generated

Real Example 2

Students receive realistic-looking but entirely fabricated academic citations

3 Causes of Hallucinations

1

Pattern matching without understanding

LLM sees "[Country]'s capital is [City]" enough times. Ask about a made-up country — it invents a plausible city.

2

Gaps in training data

If the event, place, or person isn't well represented in training data, the model guesses from similar things it does know.

3

Ambiguous prompts

"Tell me about the Paris incident" — which one? The model assumes and presents assumptions as facts.

Spot the Hallucination 🎯

Class Activity

I'll show you 3 statements. For each: Real fact or Hallucination? Discuss with a partner for 30 seconds, then we vote.

Statement 1: "The Eiffel Tower was built between 1887 and 1889." → ✓ Real

Statement 2: "The Great Wall of China is visible from space with the naked eye." → ✗ Myth — widely repeated, so LLMs repeat it too

Statement 3: "Einstein won the Nobel Prize for his theory of relativity." → ⚠️ Partially wrong — he won it for the photoelectric effect. Most dangerous type of hallucination.

Other Key Limitations

📅 No real-time info
Training has a cutoff date. No live news, weather, prices, or recent events without external tools.

🤖 No true understanding
Pattern recognition ≠ comprehension. No lived experience, emotions, or common sense reasoning.

⚖️ Biased outputs
Reflects biases in training data — gender, culture, and language. English hugely overrepresented vs other languages.

🧠 Context window limits
Forgets long conversations. Critical to design around if building real products.

Common Misconceptions

❌ Wrong ideas
  • "LLMs lie when they hallucinate"
  • "LLMs understand language like humans"
  • "More training data = no hallucinations"
  • "The AI remembers our past chats"
✓ Correct understanding
  • LLMs predict plausible text — they don't know truth
  • Pattern recognition ≠ comprehension
  • Gaps + ambiguous prompts still cause hallucinations
  • Context window = short-term only, resets each session

A Question Closer to Home

Class Discussion

"Why might an LLM perform significantly better in English than in Arabic?"

Answer: The internet has vastly more English-language content than Arabic. The model learned from what existed — so its English patterns are far richer, more nuanced, and more accurate than its Arabic ones.

Bigger implication

This means AI tools built on these models may serve Arabic speakers worse — a real equity issue in AI development.

Week 1 — Everything in One Slide

How LLMs Work

Tokens: unit of text (~¾ word)

Training: data → patterns → fine-tune

Context window: short-term memory with a hard token limit

Why They Fail

Hallucinations: 3 causes — patterns, data gaps, bad prompts

No real-time info: knowledge cutoff

Bias + no true understanding

Before You Go ✍️

Exit Ticket — write your answers on a piece of paper

Answer all 3 in 5 minutes:

1

In one sentence: what is a context window and what happens when it's exceeded?

2

Name one cause of hallucinations and give a real-world example of why it's dangerous.

3

Why might an AI product work better for some groups of people than others?

📅 Next Week

You'll test a real AI tool, try to make it hallucinate, and explore what jobs exist in AI. Come ready to experiment.